On Customizing Prosody in Speech Synthesis: Names and Addresses as a Case in Point
نویسنده
چکیده
This work assesses the contribution of domain-specific prosodic modelling to synthetic speech quality in a name-and-address information service. A prosodic processor analyzes the textual structure of labelled input strings, and inserts markers which specify the intended prosody for the DECtalk text-to-speech synthesizer. These markers impose discourse-level prosodic organization, annotate the information structure, and adapt the speaking rate to listeners in real time. In a quantitative comparison of this domain-specific modelling with the default rules in DECtalk, the domain-specific prosody was found to reduce the transcription error rate from 14.6% to 6.4%, reduce the number of repeats requested by listeners from 2.6 to 1.1, and to sound significantly easier to understand and more natural. This result demonstrates the importance of prosodic modelling in synthesis, and implies an even more important role for prosody in more complicated domains and discourse structures. 2. I N T R O D U C T I O N Text-to-speech synthesis could profitably be used to automate or create many information services, if only it were of better quality. Unfortunately it remains too unnatural and machine-like for all but the simplest and shortest texts. It has been described as sounding monotonous, boring, mechanical, harsh, disdainful, peremptory, fuzzy, muffled, choppy, and unclear. Synthesized isolated words are relatively easy to recognize, but when these are strung together into longer passages of connected speech (phrases or sentences) then it is much more difficult to follow the meaning: the task is unpleasant and the effort is fatiguing [1]. This less-than-ideal quality seems paradoxical, because published evaluations of synthetic speech yield intelligibility scores that are very close to natural speech. For example, Greene, Logan and Pisoni [2] found the best synthetic speech could be transcribed with 96% accuracy; the several studies that have used human speech tokens typically report intelligibility scores of 96% to 99% for natural speech. (For a review see [1]). However, segmental intelligibility does not always predict comprehension. A series of experiments [3] compared two high-end commercially-available text-to-speech systems on application-like material such as news items, medical benefits information, and names and addresses. The result was that the one with the significantly higher segmental intelligibility had the lower comprehension scores. Although there may be several possible reasons for segmental intelligibility failing to predict comprehension, the current work focuses on the single most likely cause: synthesis of prosody. Prosody is the organization imposed onto a string of words when they are uttered as connected speech. It includes pitch, duration, pauses, tempo, rhythm, and every known aspect of articulation. When the prosody is incorrect then at best the speech will be difficult or impossible to understand [4], at worst listeners will be misunderstand it with being aware that they have done so. Arguments for the importance of prosody in language abound in the literature. However, the cited examples of prosodic resolution of ambiguity usually are either anecdotal citations or are illustrated by small sets of carefullyconstructed cited sentences. It is not clear how important prosody is in more normal everyday texts. This brings us to the first question addressed in the current study: how much will prosody contribute to perception of synthetic speech for non-contrived, real-world textual material? 2.1. C u r r e n t A p p r o a c h e s to P r o s o d y in S p e e c h
منابع مشابه
Study on Unit-Selection and Statistical Parametric Speech Synthesis Techniques
One of the interesting topics on multimedia domain is concerned with empowering computer in order to speech production. Speech synthesis is granting human abilities to the computer for speech production. Data-based approach and process-based approach are the two main approaches on speech synthesis. Each approach has its varied challenges. Unit-selection speech synthesis and statistical parametr...
متن کاملAn Acoustic Study of Emotivity-Prosody Interface in Persian Speech Using the Tilt Model
This paper aims to explore some acoustic properties (i.e. duration and pitch amplitude of speech) associated with three different emotions: anger, sadness and joy against neutrality as a reference point, all being intentionally expressed by six Persian speakers. The primary purpose of this study is to find out if there is any correspondence between the given emotions and prosody patterning in P...
متن کاملLinguistic-prosodic processing for text-to-speech synthesis in italian
The linguistic-prosodic processing applied to text-to-speech synthesis in Italian is described. It proceeds in 5 steps: tokenisation and normalisation of abbreviations, numbers, etc.; part-of-speech tagging, based on function words, terminations and contextual heuristics; shallow parsing, based on a chunk grammar; grapheme-to-phoneme conversion, lexical stress assignment and syllabification by ...
متن کاملMeLos: Analysis and Modelling of Speech Prosody and Speaking Style
This thesis addresses the issue of modelling speech prosody for speech synthesis, and presents MeLos: a complete system for the analysis and modelling of speech prosody “the music of speech”. Research into the analysis and modelling of speech prosody has increased dramatically in recent decades, and speech prosody has emerged as a crucial concern for speech synthesis. The issue of speech prosod...
متن کاملCustomizing base unit set with speech database in TTS systems
In unit selection based speech synthesizer, defining a good unit set is crucial to the speech quality. In this paper, a method of customizing the TTS base unit set with a specific speech corpus is proposed. Multi-phoneme units are boosted from the initial phoneme-sized unit. A new multi-phoneme unit is added to the inventory based upon its own frequency count and the affected frequency count of...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1993